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Abstract 

We present the first release of the HeteroGenome database collecting latent periodicity 
regions in genomes. Tandem repeats and highly divergent tandem repeats along with 
the regions of a new type of periodicity, known as profile periodicity, have been collected 
for the genomes of Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis ele- 
gans and Drosophila melanogaster. We obtained data with the aid of a spectral-statistical 
approach to search for reliable latent periodicity regions (with periods up to 2000 bp) in 
DNA sequences. The original two-level mode of data presentation (a broad view of the 
region of latent periodicity and a second level indicating conservative fragments of its 
structure) was further developed to enable us to obtain the estimate, without redun- 
dancy, that latent periodicity regions make up --10% of the analyzed genomes. Analysis 
of the quantitative and qualitative content of located periodicity regions on all chromo- 
somes of the analyzed organisms revealed dominant characteristic types of periodicity in 
the genomes. The pattern of density distribution of latent periodicity regions on chromo- 
some unambiguously characterizes each chromosome in genome. 
Database URL: http://www.jcbi.ru/lp_baze/ 



Introduction 

Periodicity in genomes 

Periodicity regions of tandem repeats (arrays of sequen- 
tially repeated copies of original DNA sequence fragments, 
or patterns) in genomes have long been of interest for two 
main reasons: to understand the molecular mechanisms of 
the origin and evolution of the repeats and their functional 



role in genomes, and to determine potential new markers 
for population and evolutionary genetics. Alterations in 
pattern copies (substitution of original nucleotides, inser- 
tions and deletions of nucleotides) lead to the formation of 
approximate tandem repeats. Repeats whose alterations 
are hmited to substitutions of nucleotide bases (bp) are 
commonly called fuzzy tandem repeats. Approximate 
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tandem repeats (including fuzzy repeats) are regions of 
latent periodicity in a genome. 

The most studied groups of tandem repeats in genomes 
are microsatellites (patterns of ~10bp) and minisatellites 
(patterns of ~100bp) because of their use as genetic 
markers in forensics, parentage assessment, positional 
cloning and population and evolutionary genetics (1). 

Microsatellite tandem repeats are generally believed to 
arise by replication slippage (2). In contrast to microsatel- 
lites, DNA recombination-like processes that involve 
unequal crossover or gene conversion display mutation 
mechanisms in the larger minisatellite sequences (3). 
Tandem repeats can arise by consecutive gene duplication 
as a result of homologous chromosome recombination 
during meiosis (4). 

Numerous tandem repeats are situated in centromeric 
and telomeric regions of chromosomes (5). Tandem repeats 
have been found at fragile chromosome sites (6, 7). In 
some cases, the expansion of triplet microsatellites at fra- 
gile sites caused human mental retardation (8). Human 
neurological disorders may also be induced by 'dynamic' 
mutations (reduction or increase in the pattern copy num- 
ber of a tandem repeat) in both coding regions and non- 
coding regions; such mutations are not limited to occurring 
with triplet microsatellites (9-11). Tandem repeats situated 
outside the boundaries of coding regions influence the ex- 
pression of genes and the processes of transcription and 
translation (12). 

Periodicity software and databases 

A number of computer programs have been developed to 
search for perfect or nearly perfect tandem repeats, includ- 
ing the programs TRF (13), ACMES (14), MREPATT 
(15), STRING (16), mreps (17) and ATRHunter (18). 
These programs are based on various algorithms, and 
whether their results agree is dependent on the period 
length, copy number and divergence of a repeat. 

Over the past decade, programs have also been de- 
veloped to search for increasingly divergent tandem repeats 
with the aim of studying their evolutionary aspects, includ- 
ing TandemSWAN (19), IMEX (20), TRStalker (21), and a 
program based on a model of evolutive tandem repeats 
(22). Whereas perfect (or nearly perfect) tandem repeats 
vary by copy number and are dynamic because of strand 
slippage during DNA replication, highly divergent repeats 
are relatively stable structural genomic elements, but their 
functional role has so far been insufficiently studied. 

Search results for tandem repeats are collected in vari- 
ous resources such as the TRedD (23) database of tandem 
repeats, determined in the human genome with the aid of 
an algorithm described in (22). The best-known database, 



TRDB (24), contains tandem repeats revealed via the 
Tandem Repeats Finding (TRF) method (13) in entire 
sequenced genomes of eukaryotes (including human) and 
prokaryotes. TRbase (25) connects tandem repeats found 
in the human genome using the TRF method (13) with 
gene location on the chromosomes, and in particular high- 
lights those genes whose disorders result in genetic 
diseases. 

It should be noted that proposed heuristic algorithms 
revealing highly divergent approximate tandem repeats do 
not resolve the problem of the reliability of results. To en- 
sure that an approximate tandem repeat has been found, 
additional filtration of the results is usually performed. For 
example, in the program based on a model of evolutive 
tandem repeats (22), the level of pattern copy divergence in 
a repeat is limited to a maximum of ~30%. However, this 
value exceeds the divergence level of 20% required in a 
probability model for the TRF method (13). Moreover, re- 
dundancy of the TRF results (different patterns in the same 
tandem repeat may be proposed) (19) and instability of the 
results (for example, shifting of 1-3 bp along the same 
DNA sequence can lead to different pattern estimates) 
have been reported (26). 

Along with the methods mentioned above, there is also 
a statistical method (27, 28) that is known as information 
decomposition method (ID-method). The authors of the 
ID-method have created the MMsat database (29) and 
web-server LEPSCAN (30) for searching latent periodicity 
regions with period length ranging from 2 to 20 bp. 
Practical utility of the MMsat database is restricted by the 
latent micro- and minisatellites found out in the GenBank 
with the help of ID-method. 

To estimate a period length of latent periodicity in 
DNA sequence, the authors of the ID-method introduce 
Z-statistics, whose properties are based on the empiric 
samples. The foundation of such Z-statistics is a standard 
information statistics (31) applicable, as shown by 
Kullback (31), only for revealing heterogeneity in analyzed 
sequence. So, Z-statistics can be used only for the same 
goals. Hence, the empiric properties of Z-statistics do not 
guaranty the existence of latent periodicity with unknown 
structure in analyzed sequence. As a consequence, it is only 
possible to obtain the incorrect estimates for period of la- 
tent periodicity, relying on such properties. Further, we 
will compare our approach with the ID -method in more 
detail. 

In our work, an original spectral-statistical approach 
(SS-approach) (32) was applied to search for reliable la- 
tent periodicity regions in the genomes of Saccharomyces 
cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans 
and Drosophila melanogaster. As previously shown (26, 
32), such an approach prevents nonuniqueness when 
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resolving latent periodicity structure in approximate tan- 
dem repeats and optimizes estimation of the periodicity 
pattern size. 

The SS-approach reveals extremely divergent tandem 
repeats (pattern copy divergence level ~50%) and regions 
of a new type of periodicity known as profile periodicity 
(33, 34). Methods commonly applied to locate approxi- 
mate tandem repeats cannot be used to locate profile 
periodicity. 

Latent periodicity and heterogeneity 

Notion of latent profile periodicity or latent profility (32) 
expands on the notion of approximate tandem repeat (13), 
which was earlier used for recognizing latent periodicity in 
DNA sequences. Perfect tandem repeat is a textual string 
that consists of sequential copies of its substring called a 
periodicity pattern. In approximate tandem repeat, a small 
number (no more than 30%) of characters in pattern cop- 
ies are distorted by indels and nucleotide point mutations. 
If latent profile periodicity exists in DNA sequence, distor- 
tion of the characters in each position of pattern copies 
occurs in accordance with the corresponding probability 
distribution. 

In a model of latent profility, a DNA sequence (textual 
string) is considered as a realization of a special random 
periodic string called a profile string (33, 34). This string, 
consisting of independent random characters, represents a 
perfect tandem repeat of a random string, a so-called ran- 
dom periodicity pattern. This pattern of latent profility 
consists of independent random characters with a corres- 
ponding probability distribution for letters from the DNA 
alphabet. Exploring latent profility is beyond the scope of 
the present article. An example of a sequence with latent 
profility will be discussed further. 

SS-approach, used when analyzing model organisms, 
allowed to spot the periodicity of two types — approximate 
tandem repeats (13) and latent profility (32-34). 
Moreover, there are a number of the repeat-like regions, 
also meeting a criterion of significant heterogeneity in ac- 
cordance with the SS-approach used for revealing latent 
periodicity. General description of the approach will be 
done in the next section. The approach uses x^-statistics 
for testing homogeneity in DNA sequence at significance 
level characteristic for approximate tandem repeats whose 
sequences are obviously heterogenic ones. However, a sig- 
nificant heterogeneity is a necessary, but insufficient, con- 
dition for determining latent periodicity of the types 
mentioned above. 

As the results of latent periodicity search in automatic 
mode are undoubtedly heterogenic sequences and some 
additional analysis to verify their periodic structure is 



needed, further, for brevity, these results are frequently 
referred to as heterogeneities. So, the name of the database 
HeteroGenome reflects this particularity of the data col- 
lected. There are additional tools in the HeteroGenome for 
revealing latent periodicity, which will be further described 
in 'Additional sequence analysis' section. 

This article presents the first release of the 
HeteroGenome database, which collects latent periodicity 
regions (with periods up to 2000 bp) revealed by the SS- 
approach (32) in the genomes of various organisms. 
Because of nonredundancy of the results of the approach, 
for each genome under consideration we can estimate its 
coverage as a percentage of latent periodicity regions, and 
with the aid of a special parameter, which is described in 
the next section, we can analyze the conservation of peri- 
odic structure in the regions. Such analysis enables domin- 
ant characteristic types of periodicity to be located in each 
of the genomes. Therefore, HeteroGenome may be useful 
for both functional genomics and evolutionary genomics in 
searching for tandem repeats characteristic of the genome 
and for further research into the phenomenon of latent 
periodicity in DNA sequences. The database has a user- 
friendly interface and options for additional data ana- 
lysis and allows query results to be downloaded. 
HeteroGenome can be freely accessed at http://www.jcbi. 
ru/lp_baze/. 



Materials and methods 

Genomic data 

The first release of the HeteroGenome database was not 
aimed at collecting a large amount of data from a large 
number of organisms. First and foremost, the data should 
serve to mining of scale and character of latent periodicity 
phenomenon in the genome of living organisms. Therefore, 
in our work, an emphasis was applied to quantitative and 
qualitative analysis of the regions with latent periodicity 
revealed by the original SS-approach (26, 32), which is 
described in the next section. For this purpose, the four 
genomes of well-studied model organisms (35) — S. cerevi- 
siae, A. thaliana, C. elegans and D. melanogaster — were 
selected. The genomes of these organisms were among the 
first to be sequenced and, from them, moderately accurate 
and well-annotated data were obtained. Some representa- 
tive eukaryotes, ranging from unicellular (yeast) to multi- 
cellular organisms of the plant (Arabidopsis) and animal 
(Drosophila, nematode) kingdoms, also facilitate an over- 
view of latent periodicity in the genome. The original 
DNA sequences of the entire genomes of S. cerevisiae, A. 
thaliana, C. elegans and D. melanogaster were taken from 
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Table 1. Analyzed genomes of model organisms 
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Chromosomes GenBank identifiers 
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ftp://ftp.ncbi.nih.gov/genomes/. Data on chromosome con- 
tigs are shown in Table 1. 



SS-approach to revealing latent periodicity 
in DNA 

We used an original SS-approach (26, 32) to reveal la- 
tent periodicity in DNA. The SS-approach reveals latent 
periodicity by detecting significant heterogeneity at the 
test-periods of an analyzed nucleotide sequence. At each 
test-period the analyzed sequence is divided into sub- 
strings of length 1 (the final substring may have a smaller 
length). If n is the length of the analyzed sequence, then 
Ri=n/A is its test-exponent for the test-period X. This div- 
ision by substrings allows us to calculate the frequency of 
occurrence n'- of the rth letter of a nucleotide sequence al- 
phabet in the /th positions of a test-period. Matrix 



71 = {n'-)f is the sample >L-profile matrix for the analyzed 
sequence, where K is the size of the nucleotide sequence 
alphabet. Frequency occurrence p' of the ith letter in 
the analyzed sequence is determined by the matrix 



1 ^ 



(1) 



For checking sequence heterogeneity at test-period 1, 
normalized Pearson x^-statistics (32) are applied: 

,.Mln) = R^.J2tl ("J - - P') (2) 

7=1 ,=1 



Because the search for tandem repeats is conducted in 
nucleotide databases that are large in volume, to verify het- 
erogeneity in DNA sequences a significance level (type-I 
error) of a = 10^^ was chosen. At fixed test-period 1, the 
level corresponds to a critical value of Xcrii^ia, N) with N = 
(X— 1)(/— 1) degrees of freedom. Therefore, if vtip(X,n) for 
the analyzed sequence of length n at the test-period 2 meets 
the condition 



vup{Ln)ly^^„,{^,{K~l){l~l))<l 



(3) 



then homogeneity of the sequence is assumed at test-period 
X; otherwise, the sequence is considered heterogenic. Thus, 
as a spectral characteristic of the analyzed nucleotide 
sequence, the following function H is used: 



\l[X) = vup[l,n)/y}^„,[a,[K-\)[X^l)) 



(4) 



where a = [).^.^^ ~ «/5K). 

A plot of the function H (H-spectrum), which is the 
spectrum of heterogeneity manifestation in the analyzed 
sequence, demonstrates clearly the manifestation of signifi- 
cant sequence heterogeneities at those test-periods where 
H(/l) > 1. These test-periods form a spectrum of heterogen- 
eity structure for the sequence, which is further 
analyzed with the aid of an additional spectral-statistical 
characteristic. 

At every test-period a of the analyzed sequence accord- 
ing to the sample l-profile matrix n={n'-)^, an additional 
parameter is calculated, 



1 ^ 

pl(/l) =-^max{7c;- : ieh...,K} 

7=1 



(5) 



which is the character preservation level at test-period I. 
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Figure 1. Spectral-statistical characteristics of DNA sequence in the HeteroGenome database. For a sequence on chromosome IV 
(3239317-3239812 bp) of A. thaliana, spectral-statistical characteristics reveal a periodicity pattern size of 7 bp. Spectrum of heterogeneity manifest- 
ation [H-spectrum, see Equation (4)] is shown above. Spectrum of character preservation level [pl-spectrum, see Equation (5)1 is shown below. If the 
spectra are too long, they may be viewed in parts. The button 'Define new range' is used to specify the desired part. 



Therefore, for the considered sequence, the spectrum of 
heterogeneity structure is analyzed with the spectrum of 
character preservation level (pl-spectrum). The test-period 
from the spectrum of heterogeneity structure, indicated by 
the first maximal value in the spectrum of character preser- 
vation level, is considered to be an estimate of periodicity 
pattern size for region periodicity (see Figure 1). This max- 
imal value in the pl-spectrum may be interpreted as an 
average index of preservation for copies of the estimated 
periodicity pattern. 

The problem of reliability in locating approximate 
tandem repeats under the condition of small sample size 
(when the number of pattern copies in the repeat is suffi- 
ciently small) in the case of the SS-approach is resolved 
using the stochastic model of heterogeneity manifestation 
in textual strings (32). This model allows us to use add- 
itional statistical tests to check a hypothesis of heterogen- 
eity presence in a DNA sequence. 

From an algorithmic point of view, determining latent 
periodicity is difficult if a period is a priori unknown. We 
therefore searched for latent periodicity by identifying sig- 
nificant heterogeneities (at level a = 10^'') in overlapping 
windows of various sizes, in analysis of multiple-scanned 
DNA sequences with variable steps. The initial window 
size was 30 bp. The size of each subsequent window was 
double that of the previous window, to a maximum of 



4000 bp. Thus, the broad strategy used by the program, 
applying the SS-approach (32) to search for latent period- 
icity regions, resembled a 'shotgun strategy' for sequencing 
the genomes (36). Under this strategy, relatively short and 
overlapping DNA fragments were first sequenced and then 
assembled by the program into longer regions. Thus, initial 
data on regions of significant heterogeneity found in the 
genomes of the organisms passed through a number of 
specific procedures to enable additional processing and 
obtain well-defined borders for the located regions of 
heterogeneity. 

Particularities of SS-approach 

Let us compare the SS-approach used in the present work 
with the already mentioned ID-method (27, 28) on the 
basis of which MMsat database (29) and web-server 
LEPSCAN (30) have been created. To estimate a period of 
latent periodicity the ID-method exploits Z-statistics rely- 
ing on information statistics, earlier introduced by 
Kullback (31). As it was mentioned above, the information 
statistics along with Z-statistics can be used only for 
revealing heterogeneity in analyzed sequence. 

Pearson statistics [see Equation (2)] is used at prelimin- 
ary stage in the SS-approach to reveal heterogeneity in an 
analyzed sequence. Heterogeneity can also be manifested 
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Figure 2. Spectrum of Z-statistics is shown for the DNA sequence from the MMsat database (AAU92263.1 ). The considered sequence is a fragment of 
an original sequence from GenBanl< (AAU92263, indices: 61-388 bp). 
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Figure 3. The spectra of the SS-approach for DNA sequence from the IVIIVlsat database (AAU92263.1). (A) Spectrum of heterogeneity manifestation 
[see Equation (4)]. (B) Spectrum of character preservation level [see Equation (5)]. 



at the shorter test-periods than existing latent period. To 
estimate length for period of latent periodicity, a spectrum 
is introduced for another original statistics pi [see Equation 
(5)] called 'character preservation level at test-period'. This 
statistics, contrary to Pearson statistics (32, 37), informa- 
tion statistics (31) and Z-statistics (28), continues to be 
representative up to the test-period equaled to the one 
halve of analyzed sequence length because in the positions 
of each test-period it checks up matching (mismatching) 
the characters independently of their kind. Hence, such a 
test can be considered as binomial schema. So, if the high 
level (~0.5 and more) of character preservation is deter- 
mined for analyzed DNA sequence, its proximity to diver- 
gent tandem repeat is naturally supposed. Such a 
supposition is used as the main criterion in selecting the se- 
quences for the HeteroGenome database. It should be add- 
itionally mentioned that if the high level of character 
preservation is revealed at the test-period 1, the relative 



A-profile matrix can be considered as representative for the 
test-period, which was studied in (32). Consequently, 
Pearson statistics and the original statistics of character 
preservation level at test-period are complementary statis- 
tics. This allows for the HeteroGenome database to select 
the sequences that are proximal to the approximate tan- 
dem repeats. Special visual presentation of a sequence in 
the HeteroGenome helps to define the boarders of an 
approximate tandem repeat more exactly. 

Let us give an example of DNA sequence from the 
MMsat database (29), which was analyzed with the help 
of SS-approach. Spectrum of Z-statistics obtained from the 
MMsat for this sequence is shown in the Figure 2. Based 
on the maximal peak in Z-statistics spectrum, a conclusion 
was done about existing in the sequence latent periodicity 
with period of 54 bp and with unknown structure. In con- 
sidering the pi spectrum (see Figure 3B), one can see that 
maximal value of character preservation level for this 
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1 50 

ggaccctagagaacgaatataggactcgagaggaagactcagaggacgaa 

ggagctcagagaaggaatataggactttagagaaagactcagaagacgaa 
ggac 

SI 100 

tatgagaccctagaagacgaatataggactcgagaggacgaatatgaaac 
tatgggagcccagagagcgaatataggacccgagaggacgaatatggaac 

101 ISO 

cctagaagaagataaatatgggatcctagaggacgaatatgaaaccctag 
tctagaggaagactcagaatacgaatatgggagcctgggggaagactcag 

151 162 
aagacgaa tacg 
aggacaaatatg 

Figure 4. Alignment of DNA sequence from the MMsat database 
(AAU92263.1) is shown according to the estimate of 162 bp for latent 
period, which was obtained with the help of SS-approach. Matching 
characters in the positions of the tandem repeat of the two copies are 
shown in red. 

sequence is achieved at the test-period 1 = 162 bp where 
pl(l)= 0.85. Therefore, the analyzed DNA sequence is an 
approximate tandem repeat (see Figure 4). This example 
demonstrates that in the case of the high values of the pi 
spectrum the upper limit for permitted test-periods in the 
H spectrum can be significantly enlarged as it was done 
here — up to a halve of sequence length (see Figure 3A). 

It should be noted that in the MMsat database (29) 
a period estimate is limited by the value of 100 bp. 
Employment of the character preservation level statistics in 
the HeteroGenome database allows extending this bound 
up to 2000 bp and even more. 

It should be also mentioned that spectrum of Z-statistics 
(28) is nonreproducible for other researchers. This is owing 
to the fact that the Monte Carlo method (28) is used at 
fixed test-period for Z-statistics calculation, for which a 
way of choosing random elements for the matrixes is not 
pointed out. Furthermore, at long test-periods (~50 bp and 
more) for reliability of Monte Carlo results, a number of 
the tests that are needed is too large (even for powerful 
computer). Again, in general case, when there is little num- 
ber of the repeats for test-period, initial matrix, used in 
Z-statistics calculations, becomes statistically instable. So, 
in general, for long test-periods or for a small number of 
their repeats, usage of Z-statistics becomes problematic. 

In the earlier works (33, 34), the methods for recogniz- 
ing new type of latent periodicity, called latent profile peri- 
odicity (latent profility), have been elaborated. This type of 
latent periodicity expands on the notion of approximate 
tandem repeat. It appears that the HeteroGenome database 
has collected many DNA sequences in which latent profile 
periodicity is recognized. To compare a possibility of re- 
vealing the latent periodicity with the help of web-server 



LEPSCAN (30), a number of DNA sequences, in which 
such profile periodicity with period of <20bp had been 
recognized, were selected according to the upper period 
border allowed for search by the LEPSCAN. No latent 
periodicity was found by this web-server in the selected se- 
quences. Let us give a particular example of the two such 
DNA sequences from the HeteroGenome. These sequences 
were from A. thaliana chromosome I (11780 828-11 780 
960 bp and 11 983 025-11 983 293 bp). In the first se- 
quence, the latent profile periodicity with period of the 
4 bp has been recognized and in the second one, with 
period of the 1 1 bp. 

Results 

Applying the SS-approach (26, 32) enabled us to develop a 
complex program that was able to reliably reveal latent 
periodicity regions of differing types (approximate tandem 
repeats, fuzzy repeats, profile periodicity). The level of pat- 
tern copy divergence in tandem repeats was limited to 
50%. The program revealed equally well microsatellites 
with patterns in the range of 2-10 bp and minisatellites 
with patterns ~ 10-1 00 bp, as well as megasatellites with 
longer patterns up to 2000 bp. The minimum copy number 
in the revealed approximate tandem repeats and fuzzy 
repeats was two. 

Along with highly divergent tandem repeats, regions of 
the new type of latent periodicity known as latent profile 
periodicity (33, 34) were also revealed. 

Reliable (at significance level a = 10^*') heterogeneity 
regions, mainly approximate tandem repeats, revealed in 
the entire genomes of S. cerevisiae, A. thaliana, C. elegans 
and D. melanogaster were collected in the HeteroGenome 
database (www.jcbi.ru/lp_baze). 

Description of the database 

HeteroGenome is relational database managed by 
MySQL. For user convenience, the data search is organized 
across all fields of the database (see Figure 5) with the op- 
tion of sorting data by field (Location, Region Length, 
Period, Exponent, Preservation Level). A detailed descrip- 
tion of all the fields and their acceptable values is given in 
separate windows accessible by clicking on the fields' 
names. The User manual with the examples demonstrating 
how to work with the database is placed on the site. Search 
results can be downloaded as a plain text file. Because 
of the two-level structure of each record (see Figure 6), 
described in the next section, the database interface offers 
two modes of information query: nonredundant, in which 
the sequences of the first level are searched; and simple, in 
which all sequences in the database are searched. 
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Figure 5. Query form and data output. Output is shown of all reliable heterogeneity (latent periodicity) regions on chromosome I of A. thaliana corres- 
ponding to the first, nonredundant, level of the HeteroGenome records. 
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Figure 6. Two-level structure of a record in HeteroGenome. DNA sequences with the revealed latent periodicity of 7 bp on chromosome IV of A. thali- 
ana constitute a group in which the two levels of structure are distinguished. A sequence of the greatest length, representing the entire group, is 
placed at the first level. The remaining sequences making up the parts of the representative sequence correspond to internal representative regions 
of clear periodicity structure and form the second level (INTRINSIC HETEROGENEITIES) in the group. A graphical schema of the group is shown at 
the bottom. The sequences at the first level from all groups constitute nonredundant data in HeteroGenome. 
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For each sequence in HeteroGenome, separate windows 
show the user: (i) H- and pl-spectrum [see Equations (4) 
and (5) and Figure 1], based on which an estimate of pat- 
tern periodicity size is proposed; and (ii) the analyzed 
DNA sequence presented as a column of consecutive se- 
quence segments with lengths equal to the pattern size esti- 
mate (see Figure 7). With information obtained from the 
spectral visual analysis, the user has the option of changing 
the segment length to more exactly estimate the pattern 
size for the approximate tandem repeat. Moreover, by fol- 
lowing a link to the Sequence Viewer (http://www.ncbi. 
nlm.nih.gov/projects/sviewer/) graphical interface, the user 
is able to obtain information on the functional context of 
chromosomal location data. 

Two levels of data presentation 

For the presentation of nonredundant data in 
HeteroGenome, we developed a logical unit record as 
a group of DNA sequences with reliable heterogeneity 
(latent periodicity), which intersect or, as a rule, have the 
same or multiple period lengths. Two levels of data presen- 
tation were distinguished in the group. At the first level, 
the longest DNA sequence, which is representative of the 
group, is considered. The remaining sequences in the 
group belong to the second level. Generally, these are well- 
determined local periodicity fragments of the group repre- 
sentative. An example of such a record, showing the 
two-level organization, can be seen in Figure 6. 

In addition to the groups containing DNA sequences of 
both levels (the group representative and elements of its 
internal heterogeneity) in HeteroGenome, other groups 
contain only a single DNA sequence representative of the 
group. Individual groups do not intersect. Thus, sequences 
representing the groups form nonredundant chromosome 
coverage by regions of reliable heterogeneity (latent 
periodicity). 

HeteroGenome can be searched for certain information 
on latent periodicity (with a given period length, period- 
icity region size, character preservation level pi for pattern 
copies and so on), as described in the following sections, 
at the first (nonredundant) level only, or at the first and se- 
cond levels together. 

Additional sequence analysis 

Period length and coordinates for any region of latent peri- 
odicity can be determined in HeteroGenome with greater 
accuracy by using the results of visual analysis of the spec- 
tral parameters (see Figure 1), by subdividing the sequence 
into segments of estimated period length (see Figure 7), 
and by providing a length for flanking region sequences. 
An example of such analysis is described in the User man- 
ual on the database site. 
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Figure 7. DNA sequence view in HeteroGenome. Consecutive segmen- 
tation of a DNA sequence on chromosome IV (3239491-3239622 bp) of 
A thaliana is shown corresponding to the revealed latent periodicity 
pattern size /l = 7bp (see Figure 6, the sequence parameters are shown 
in red). Left and right flanking regions of length /=21bp are also 
shown. Segment size A and length / of flanking region can be redefined. 
Pressing the 'Change' button generates corresponding sequence seg- 
mentation in a new window. 

Additional analysis can lead to a more precise reading 
of the data because the data specification and data distri- 
bution across the groups are generated automatically. 
After studying group content, the user can correct the 
group coordinates or even consider the group as consisting 
of several subgroups. 



Analysis of latent periodicity in HeteroGenome 

A comparison of data for the genomes of S. cerevisiae, 
A. thaliana, C. elegans and D. melanogaster in HeteroGe- 
nome and TRDB (24) shows that HeteroGenome in fact 
contains all the repeats presented in TRDB and supple- 
ments them with data on highly divergent tandem repeats. 

HeteroGenome also indicates regions of the new type 
of periodicity called latent profile periodicity or profility 
(33, 34) in the genome. The notion of latent profility 
expands on the notion of the tandem repeat. A pattern of 
latent profility consists of independent random characters 
with a corresponding probability distribution for letters of 
the DNA alphabet. This type of periodicity requires future 
research. Figure 8 shows an example of a sequence with 
a profility of period length 10 bp, obtained from 
HeteroGenome. 

It is interesting to note that TRDB (24) contains ~40 dif- 
ferent tandem repeats for the fragment shown in Figure 8, 
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Figure 8. Fragment of a sequence on chromosome I (1 661 021-1 663 249 bp) of C. elegans in HeteroGenome for which a latent profility of 10 bp was 
revealed. For displayed fragment pl(lO) = 0.77 



Table 2. Proportion of regions with reliable heterogeneity 
(latent periodicity) in analyzed genomes 



Species 


Genome 


Total 


Percentage of 




length, bp 
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length, bp 




S. cerevisiae MT 


85 779 


20 892 


24.4 


S. cerevisiae 


12 070 900 


419 909 


3.5 


A. thaliana 


119146 348 


4 247 672 


3.6 


C. elegans 


100269 917 


6 692 629 


6.7 


D. melanogaster 


120 381546 


5 108 483 


4.2 



but only one of them has a pattern size of 10 bp (5.4 pat- 
tern copies). Nevertheless, there is a clear structural regu- 
larity of 10 bp for the fragment. The presence of a latent 
profile periodicity of 10 bp in the sequence has been veri- 
fied by the methods described in (33, 34). 



Genome coverage by latent periodicity 

An important quantitative index used to study the evolu- 
tion and functional meaning of latent periodicity regions in 
genomes is the amount of genome coverage by these re- 
gions. Nonredundant data on regions of reliable heterogen- 
eity (latent periodicity) enables a sufficiently precise 
estimation of the percentage of latent periodicity regions 



(including highly divergent tandem repeats and profility) in 
analyzed genomes. Table 2 presents these estimates. 

A high percentage (24.4%) of latent periodicity regions 
was revealed in the yeast S. cerevisiae mitochondrial gen- 
ome. As Figure 9 indicates, more than half of the regions 
(13.3%) are highly divergent microsatellites. 

The majority of latent periodicity regions in the nuclear 
genomes of the analyzed organisms consist of micro- and 
minisatellites (period length <100bp). In the human gen- 
ome, the proportion of micro- and minisatellites is com- 
monly estimated to be 3% (36). Together with those 
repeats with periods that do not exceed 2000 bp, the pro- 
portion of tandem repeats is ~10% (19). Taking into ac- 
count the data in Table 2, we may suppose that the 
periodicity, shown by tandem repeats with periods 
<2000bp, varies in eukaryotic genomes to within 10%. 
This percentage may be conditioned by a balance between 
a molecular mechanism for the origin of tandem repeats 
and their divergence, which thus stabilizes the repeat 
length. 

Latent periodicity impact on chromosome length 

Let us consider how the proportion of latent periodicity re- 
gions (reliable heterogeneity regions shown by tandem re- 
peats) depends on the length of the chromosomes of the 
analyzed organisms (see Figure 10). Specific dispersion as a 
percentage of latent periodicity regions covering single 
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2< L<10 10<L< 100 100<L<2000 



■0.4<pl<0.7 ■0.9<pl<1.0 
■ 0.7 < pi < 0.8 0.8 < pi < 0.9 

Figure 9. Histogram of structural content for latent periodicity regions 
in mitochondrial genome of S. cerevisiae. Corresponding to revealed 
period L, for micro- (2 < L< 10), mini- (10 < i.< 100) and megasatellites 
(100 < L< 2000), coverage (as a percentage) of the mitochondrial gen- 
ome by latent periodicity regions with various preservation levels [pKD, 
see Equation (5)] are shown separately. The preservation level in range 
0.4 < pi < 0.7 (red) corresponds to highly divergent tandem repeats; that 
in range 0.7 < pi < 0.8 (green) corresponds to moderately divergent tan- 
dem repeats; that in 0.8 < pi < 0.9 (yellow) corresponds to slightly diver- 
gent tandem repeats; and that in 0.9<pl<1.0 (blue) corresponds to 
perfect tandem repeats. 

chromosomes was observed for each model organism. The 
highest dispersion (4.95%) between maximal (8.91% on 
chromosome I) and minimal (3.96% on chromosome X) 
coverage was observed in the C. elegans genome. In the 
genome of D. melanogaster, this dispersion is approxi- 
mately one-third less (3.11%), ranging from 3.30% 
(chromosome IV) to 6.41% (chromosome X). For both or- 
ganisms, the amount of dispersion is caused by the particu- 
lar location of the X chromosome. Moreover, in the case 
of C. elegans, the chromosome has minimal coverage by 
percentage, but in D. melanogaster the X chromosome has 
the most coverage of all chromosomes in the genome. 

Dispersion of chromosome coverage by latent period- 
icity regions for the D. melanogaster genome is compar- 
able with that for S. cerevisiae. Dispersion for S. cerevisiae 
(3.57%) ranges from 2.7% (chromosome XVI) to 6.27% 
(chromosome I). Note that chromosome I, which has the 
highest periodicity percentage in the S. cerevisiae genome, 
is the shortest chromosome of the yeast. 

Although in the genomes of S. cerevisiae, C. elegans and 
D. melanogaster, percentage of latent periodicity disper- 
sion on chromosomes is comparable with the average per- 
centage of periodicity in the single genomes, in the genome 
of A. thaliana this dispersion does not exceed 0.75%. As 
we can see in Figure lOB for arabidopsis, with growth in 
chromosome length, the proportion of latent periodicity 
regions remains almost constant. 

In fact, for all analyzed genomes of the model organ- 
isms, with increasing chromosome length the proportion of 



latent periodicity regions is generally fixed or decreases. 
We may suppose that, because of their instability and abil- 
ity to elongate during DNA replication, the tandem repeats 
moderately (~10%) influenced the length of chromosomes 
of the model organisms. 

The most probable mechanism repressing the elong- 
ation of periodicity regions is a divergence that is suffi- 
ciently rapid and prevents slippage of DNA strings during 
chromosome replication. 

Structure preservation in latent periodicity regions 

Figure 11 shows histograms of the qualitative structural 
content of the revealed periodicity regions for all chromo- 
somes in the genomes of the model organisms. Percentages 
of chromosome length occupied by highly divergent, mod- 
erately divergent, slightly divergent and perfect tandem 
repeats are shown separately for the micro-, mini- and 
megasatellites. 

As can be seen from Figure 11 A, latent periodicity re- 
gions in the S. cerevisiae genome are generally indicated by 
highly divergent sequences of microsatellites, which make 
up ~2% of regions in the genome. Highly divergent minis- 
atellites make up <1% of the genome. Except for chromo- 
somes I and IX, megasatellite repeats with period lengths 
> 100 bp are absent in the yeast genome. 

Some chromosomes in the genome of S. cerevisiae have 
similar histograms for all three considered groups of satel- 
lites, such as chromosomes VIII and XII, chromosomes 
XIII, VI and XIV and chromosomes XI, XV and VII. 

For the plant A. thaliana (Figure IIB) and the nematode 
C. elegans (Figure IIC), the genome length of which has a 
greater order of magnitude than that of the yeast S. cerevi- 
siae (Figure 11 A), the histograms show similar tendencies 
in terms of the qualitative makeup of the tandem repeats in 
the genomes. 

Firstly, in the genomes of A. thaliana and C. elegans, 
highly divergent minisatellites account for ~1-1.5%, 
which is a perceptible proportion and is similar to their 
percentages of microsatellites. Therefore, in A. thaliana 
and C. elegans, mini- and microsatellites contribute simi- 
larly to structural and functional genome organization. 
Secondly, some megasatellite repeats, which are almost 
completely absent in the yeast genome and, as described 
below, are numerically insignificant in the Drosophila 
genome (Figure IID), in the genomes of arabidopsis and 
nematode are sufficiently important and make up ~1% of 
regions. 

Neglecting some particularities for chromosomes II and 
III, the histograms for A. thaliana chromosomes are similar 
(see Figure IIB). In fact, except for the chromosome X and 
III histograms, the remaining histograms for C. elegans 
(see Figure IIC) are identical. 
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Figure 10. Coverage of chromosomes of analyzed organisms by regions of reliable heterogeneity (latent periodicity). For each organism, S. cerevisiae 
(A), A. thaliana (B), C. elegans (C) and D. melanogaster{D), chromosomes are ordered by ascending length, as shown in the respective images on the 
right. Solid straight lines show trends. Percentage of latent periodicity regions on each chromosome is determined at the nonredundant level of 
records in HeteroGenome. See the text and Figure 6 for details. 



Histograms produced using the results of structural 
analysis of latent periodicity regions revealed on the 
chromosomes of fruit fly D. melanogaster are shown 
in Figure IID. It can be seen that highly divergent 



(~1.5-2%) and moderately divergent microsatellites 
(~0.5-l%) are dominant in the Drosophila genome. Note 
the similarities between histograms in Figure IID, with the 
exception of those for chromosomes 4 and X. 
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■0.4<pl<0.7 ■0.7<pl<0.8 0.8<pl<0.9 ■0.9<pl<1.0 

Figure 11. Structural content for latent periodicity regions in the genomes of S. cerevisiae (A), A. thaliana (B), C. elegans (C) and D. melanogaster {D). 
Corresponding to revealed period L, for micro- (2 < /. < 10), mini- (10 < /. < 100) and megasatellites (100 < /. < 2000), coverage (as a percentage) of gen- 
ome by periodicity regions with various preservation levels [pl(L), see Equation (5)] are shown as separate histograms. The preservation level in 
range 0.4 < pi < 0.7 (red) corresponds to highly divergent tandem repeats; that in range 0.7 < pi < 0.8 (green) corresponds to moderately divergent 
tandem repeats; that in 0.8 < pi < 0.9 (yellow) corresponds to slightly divergent tandem repeats; and that in 0.9 < pi < 1.0 (blue) corresponds to perfect 
tandem repeats. The order of chromosomes in the figure was determined by visual similarities between their histograms 



Similarities between histograms reflecting the qualita- 
tive structural content of latent periodicity regions on 
the single chromosomes of the analyzed organisms are not 
evidence of a common evolutionary origin of the chromo- 
somes; however, they do allow us to propose a similar 
mechanism for the evolutionary pressure and divergence 
under which the chromosomes were formed. 

Moreover, as indicated in Figure 11, in each genome we 
can differentiate one or two dominant characteristic types 
of latent periodicity such as highly divergent microsatellites 
in the S. cerevisiae genome. The genomes of A. thaliana 
and C. elegans have similar percentages of characteristic 
types of latent periodicity (highly divergent micro- and 
minisatellites ~1.5%). 

Determining the functional role of dominant periodici- 
ties in a genome is likely to be a topic of future research. 



Latent periodicity in functional regions 

Via a hyperlink to the Sequence Viewer graphical interface 
(http://www.ncbi.nlm.nih.gov/projects/sviewer/), for any 
DNA region in HeteroGenome, information can be ob- 
tained about the region intersection with annotated se- 
quences of the investigated genome. Table 3 presents a 
general overview of HeteroGenome data distribution over 
functional (annotated) and functionally unassigned, usu- 
ally noncoding, DNA sequences of genomes. 

To estimate the distribution of the HeteroGenome 
groups over annotated regions of genome, a criterion that 
is nonstrict enough was used. If the intersection area be- 
tween group and functional region was not less 50% of 
minor sequences, then the group is assigned to considered 
region. While estimating the groups' distribution over 
annotated regions, an alternative splicing was not taken 
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Table 3. HeteroGenome groups' distribution over functional regions In the genomes 



GenBank features 


S. cerevisiae" 


A. thaliana 


C. elegans 


D. melanogaster 


dene 


J A / D 


Z 1 J "o 




25551 


T-OO J / 


mRM A 




LJJJ/ 




12667 


zoo / 7 






LJJ/\J 




12046 




Intron'^ 




3105 




14145 


28112 










_ 


21 


STS 




141 




_ 


1 50 


tKlN A 




c 

J 




6 


z 


iKiN A 




/ 






Q 

o 


ncRNA 




3 








misc_RNA 


4 










rep_origin 


15 










repeat_region 


95 








1951 


LTR 


11 










Unassigned 


738 


12 935 




13 773 


22 851 


HeteroGenome Groups 


4094 


34 566 




39 329 


72 772 


^Theie are no mRNA features in the annotation for S. cerevisiae genome. 


, ^Tlie data on the introns were obtained from attribute line 'join' for 'mRNA' ('CDS' 


for S. cerevisiae) feature description. 










Table 4. The whole number of the nucleotides from the HeteroGenome groups over gene regions in the genomes 


Species 


Nucleotides in the groups/nucleotides 


in genome 










Genes 


Exons'' 






Introns^ 


S. cerevisiae 


354459/8 829 668, (4%) 


352 781/8 737430, (4%) 




448/64 756, (0.7%) 


A. thaliana 


2 037 728/70 751773, (3%) 


1601059/40167 753, (4%) 




296 614/19 355 644, (1.5%) 


C. elegans 


3 816 463/60 377579, (6.3%) 


1479515/27267246, 


(5.4% 


) 


402 554/32 373 603, (10%) 


D. melanogaster 


3 786123/79 344223, (4.8%) 


3 410523/28 271547, 


(12%) 




4419 548/49121309,(9%) 



Note: The share of groups' nucleotides in the total length of corresponding functional regions in genome is shown in parenthesis. 

*The data on the exons and introns were obtained from attribute line 'join' for 'mRNA' feature description ('CDS' feature description for S. cerevisiae). 



into account, i.e. for one gene only one mRNA and Coding 
DNA Sequence (CDS) were considered. 

As one can see from the Table 3, according to the criter- 
ion introduced above, for the genomes of S. cerevisiae, 
A. thaliana, C. elegans and D. melanogaster, correspond- 
ingly, 80, 62, 65 and 67% of the HeteroGenome groups 
are located in the genes. In accordance with the list of or- 
ganisms, 18, 37.4, 35 and 31.4% of the groups are located 
in unannotated (unassigned) regions of their genomes. In 
general, distribution over other functional regions, besides 
the genes, is random and unessential. However, it should 
be mentioned that 2.6% of the HeteroGenome's groups 
are located in various repeats of D. melanogaster. 

For the groups, located in the genes, there is a high 
probability to be assigned both to intron and exon because 
many groups cross the boarders between them. So, add- 
itional analysis was done to calculate the sums of the nu- 
cleotides (in the HeteroGenome groups) occurring in the 
introns and exons separately. Table 4 presents the results 
of such an analysis, in relation with total number of the nu- 
cleotides occurring in functional regions of whole genome. 



Comparing Tables 2 and 4, one can see that the shares of 
nucleotides, located in the genes, for all organisms in the 
HeteroGenome are practically similar to the shares of gen- 
ome coverage by the regions of latent periodicity. The dis- 
tribution of nucleotides over exons and introns depends on 
the organism. Thus, for example, the percentage of the 
whole length of latent periodicity regions in the exons of 
D. melanogaster (12%) is higher than in the mtrons (9%), 
despite the fact that the introns are nearly twice longer 
than the exons. On the contrary, with whole length of the 
exons and introns being similar in C. elegans genome, 
share of latent periodicity regions in the introns is twice 
higher than it is in the exons (10 and 5.4%, correspond- 
ingly). Hence, direct dependence is not observed between 
the whole length of latent periodicity regions and genome 
length, or between functional regions' total length. 

Density distribution of latent periodicity regions over the 
chromosomes 

The study of density distribution of latent periodicity re- 
gions over the chromosomes was done for all analyzed 
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Figure 12. An example of the histograms showing density distributions for latent periodicity regions revealed on the chromosomes. The height of 
histogram bar equals a part of the whole percentage of latent periodicity regions on chromosome, associated with every sequential fragment 
of length equal to histogram's step (see text for details). (A-C) Histograms show the distributions for the chromosomes I from the genomes 
S. cerevisiae, A. thaliana and C. elegans, correspondingly. (D) Histogram is shown for chromosome IV from the genome of D, melanogaster. 



genomes in the work. Each chromosome was divided into 
consecutive fragments of the same size that account for 
0.5% of the whole chromosome length. Such a size is con- 
sidered an interval of division. Then a sum of the lengths 
(in nucleotides) for latent periodicity regions located 
within the fragment's boarders was calculated for each 
fragment. Such a sum, normed at the length of the whole 
chromosome and multiplied by 100%, was considered a 
part from the whole percentage of latent periodicity re- 
gions on chromosome, which was assigned to the each 
fragment. Summarizing all parts in the division gives an es- 
timate of general percentage of latent periodicity on 
chromosome. Distributions of numerical values for such 
shares over the four chromosomes from the organisms con- 
sidered in the work are shown in Figure 12. In investigating 
density distribution of latent periodicity regions, only the 
sequences that represent the groups were taken into ac- 
count as giving nonredundant estimate of chromosome 
coverage by latent periodicity. 



Histograms in Figure 12 demonstrate the density 
distribution for the regions of latent periodicity over 
chromosomes I from the genomes of yeast S. cerevisiae (A), 
plant A. thaliana (B) and nematode C. elegans (C), 
and also over chromosome 4 from the genome of fruit 
fly D. melanogaster (D). Corresponding steps of the div- 
isions for the chromosomes are equal to 1151, 152138, 
75 362 and 6759 bp. Results for all chromosomes of 
considered organisms are present on page Statistics 
of the HeteroGenome database (http://www.jcbi.ru/lp_ 
baze/). 

Moreover, for each chromosome the three additional 
density distributions were obtained following three differ- 
ent classes of periodicity in genome, i.e. micro-, mini- and 
megasatellites with period lengths in the intervals 
2 < L < 10, 10 < L < 100 and 100 < L < 2000, correspond- 
ingly. Figure 13 shows an example of such distributions for 
chromosome I from C. elegans genome (see Figure 12C) 
for their combined distribution). 
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Figure 13. Histograms of density distribution for the regions of latent periodicity revealed on chromosome I of C. elegans genome are shown for the 
three classes of periodicity — (A) micro-, (B) mini- and (C) megasaellites. 



As it was demonstrated in the examples in Figures 12 
and 13, density distribution of the latent periodicity 
over the chromosomes unambiguously characterizes each 
chromosome in the genomes. Such distribution may be 
considered as some kind of DNA fingerprint or bar code 
for every chromosome in the genomes of various 
organisms. 

Conclusions 

As a result of applying the SS-approach to revealing hetero- 
geneity (latent periodicity) regions in the genomes of model 
organisms — S. cerevisiae, A. thaliana, C. elegans and 
D. melanogaster — reliable data were obtained for tandem 
repeats, including highly divergent repeats and regions of a 
new type of latent periodicity called profihty (33, 34) (see 
Figure 8 for example), from the HeteroGenome database. 

Because of its user-friendly interface and options for 
additional data analysis, HeteroGenome may be useful 
both in searching for new markers in molecular genetic in- 
vestigations of organisms and for advanced research into 
latent periodicity in DNA sequences. The first efforts to re- 
search latent periodicity phenomenon in genome with the 
help of HeteroGenome database have been done in (38). 

Latent profility may be located with the aid of the meth- 
ods described in (33, 34) for a number of regions where 
0.4<pl< 0.7 [see Equation (5) and Figure 1]. A specially 



developed two-level structure of records in HeteroGenome 
enables the data to be presented without redundancy and 
indicates conservative fragments in the regions of latent 
periodicity. 

Taking into consideration the HeteroGenome data for 
the genomes of S. cerevisiae, A. thaliana, C. elegans and 
D. melanogaster, and the results of other research (19, 36), 
it may be that latent periodicity regions constitute ~10% 
of regions in the genomes of various organisms. Highly di- 
vergent microsatellite repeats (with period lengths 
<10bp), amounting to ~2% of the entire genome length 
are dominant in all of the aforementioned organisms. As 
we have described, the qualitative and quantitative content 
of regions of latent periodicity are characteristic of the gen- 
omes. For example (see Figure 11), in the genome of yeast 
S. cerevisiae (except for chromosomes I and IX) and in that 
of fruit fly D. melanogaster, megasatellite repeats (with 
period lengths >100bp) are almost completely absent. In 
contrast, megasatellite repeats in the genome of plant 
A. thaliana and in that of nematode C. elegans are suffi- 
ciently noticeable and account for ~1% of the genomes. 

Analysis of the HeteroGenome data distribution over 
functional (annotated in GenBank) and nonfunctional (un- 
assigned) genome DNA sequences shows that the share 
of latent periodicity in the genes is similar to the general 
genome coverage by the regions of latent periodicity. 
Nevertheless the regions' distribution between the exons 
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and the introns depends on the organism. Thus, a percent- 
age of the whole length of latent periodicity regions in the 
exons of D. melanogaster (12%) is higher than in the in- 
trons (9%), though the introns' length is nearly twice 
greater the exons' length. On the contrary, while the whole 
length of exons and introns are similar in C. elegans gen- 
ome, the share of latent periodicity regions in the introns is 
twice higher than it is in the exons (10 and 5.4%, 
correspondingly). Therefore, the direct dependence is not 
observed between the length of latent periodicity regions 
and genome or its functional regions total length. 

It should be noted that general distribution of latent 
periodicity regions over the chromosomes in genome is a 
unique characteristic for each of the considered model 
organisms. 
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